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Abstract 

We  describe  a  state-space  tracking  approach  based  on  a 
Conditional  Random  Field  ( CRF)  model,  where  the  obser¬ 
vation  potentials  are  learned  from  data.  We  find  functions 
that  embed  both  state  and  observation  into  a  space  where 
similarity  corresponds  to  L\  distance,  and  define  an  obser¬ 
vation  potential  based  on  distance  in  this  space.  This  po¬ 
tential  is  extremely  fast  to  compute  and  in  conjunction  with 
a  grid-filtering  framework  can  be  used  to  reduce  a  contin¬ 
uous  state  estimation  problem  to  a  discrete  one.  We  show 
how  a  state  temporal  prior  in  the  grid-filter  can  be  com¬ 
puted  in  a  manner  similar  to  a  sparse  HMM,  resulting  in 
real-time  system  performance.  The  resulting  system  is  used 
for  human  pose  tracking  in  video  sequences. 

1  Introduction 

Tracking  articulated  objects  (such  as  humans)  is  an  exam¬ 
ple  of  state  estimation  in  a  high-dimensional  space  with  a 
non-linear  observation  model  that  has  been  a  focus  of  con¬ 
siderable  research  attention.  The  combination  of  frequent 
self-occlusion  and  unobservable  degrees  of  freedom  with 
the  large  volume  of  the  pose  space  make  probabilistic  meth¬ 
ods  appealing.  The  vast  majority  of  probabilistic  articulated 
tracking  methods  are  based  on  a  generative  model  formula¬ 
tion. 

Current  state-of-the-art  generative  tracking  algorithms 
use  non-parametric  density  estimators,  such  as  particle  fil¬ 
ters,  due  to  their  ability  to  model  arbitrary  multimodal  dis¬ 
tributions  [18,  10].  Unfortunately,  several  properties  con¬ 
spire  to  make  particle  filtering  extremely  computationally 
intensive.  On  one  hand,  a  large  number  of  particles  is 
needed  in  order  to  faithfully  model  the  distributions  in  ques¬ 
tion.  On  the  other  hand,  a  complex  likelihood  function 
needs  to  be  evaluated  for  each  particle  at  every  iteration 
of  the  algorithm.  A  further  drawback  of  generative-model 


based  algorithms  is  that  the  likelihood  function  is  too  com¬ 
plicated  to  be  learned  from  data  and  is  usually  specified  in 
an  ad-hoc  fashion.  Recently,  the  use  of  directed  discrimina¬ 
tive  models  with  parameters  learned  directly  from  data  have 
been  proposed  [1,  27,  22]. 

In  this  work  we  pose  state  estimation  as  inference  in  an 
undirected  Conditional  Random  Field  model  (CRF)  [17]. 
This  allows  us  to  replace  the  likelihood  function  with  a 
more  general  observation  potential  (compatibility)  function 
that  can  be  automatically  learned  from  training  data.  These 
functions  might  be  expensive  to  evaluate  in  general,  but  can 
be  made  efficient  at  run-time  if  all  state  (pose)  values  at 
which  they  can  be  evaluated  are  known  in  advance.  In  this 
case  much  of  the  computation  can  be  performed  off-line, 
thus  greatly  reducing  run-time  complexity. 

This  algorithm  naturally  operates  on  a  discrete  set  of 
samples,  and  we  will  show  how  we  can  estimate  the  pos¬ 
terior  probability  in  a  continuous  state  space  using  grid  fil¬ 
tering  methods.  The  idea  underlying  these  methods  is  that  if 
the  state- space  can  be  partitioned  into  regions  that  are  small 
then  the  posterior  can  be  well  approximated  by  a  constant 
function  within  each  region. 

The  direct  application  of  grid  filtering  would,  of  course, 
result  in  the  need  to  evaluate  the  potential  function  in  each 
region  in  the  partition,  which  is  impossible  to  do  in  reason¬ 
able  time  even  with  fast  implementation.  Fortunately  this  is 
not  necessary,  since  at  a  particular  time-step,  the  prior  state 
probability  would  be  negligible  in  the  vast  majority  of  the 
regions,  allowing  us  to  concentrate  only  on  locations  with  a 
significant  prior. 

Our  algorithm  operates  in  a  standard  predict-update 
framework:  at  every  step  we  first  estimate  the  temporal 
prior  probability  of  the  state  being  in  each  of  the  regions 
in  the  partition.  We  then  evaluate  the  observation  potential 
only  for  regions  with  non-negligible  prior.  When  the  set  of 
cells  is  fixed,  we  can  precompute  the  transition  probabili¬ 
ties  between  cells,  and  thus  reduce  the  temporal  prior  com- 
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putation  to  a  single  sparse  matrix/vector  multiplication ,  in 
a  manner  similar  to  HMMs  [20],  thus  avoiding  a  sampling 
step  altogether. 

After  reviewing  related  prior  work,  we  first  describe  the 
CRF-based  tracking  formulation  and  describe  a  way  to  learn 
a  particular  observation  potential  function  based  on  image 
embedding  (Section  3).  We  then  discuss  a  grid- filter-based 
inference  method  which  can  be  realized  with  a  sparse  HMM 
computation  (Section  4).  The  results  of  our  method  are 
demonstrated  and  compared  against  competing  algorithms 
in  Section  5. 

2  Prior  Work 

Probabilistic  articulated  pose  estimation  is  often  ap¬ 
proached  using  state- space  methods.  The  majority  of  the 
approaches  have  been  based  on  a  generative  model  formu¬ 
lation,  with  varying  assumptions  about  the  forms  of  the 
pose  distribution  and  transition  probabilities.  Early  meth¬ 
ods  [14,  21]  assumed  that  both  were  Gaussian  and  used 
Kalman  filtering.  Extended  and  Unscented  [28]  Kalman 
filters  enabled  modeling  of  non-linear  transitions,  but  still 
constrained  pose  distribution  to  be  Gaussian.  These  meth¬ 
ods  required  a  relatively  small  number  of  evaluations  of  the 
likelihood  function,  but  lost  track  due  to  restrictive  distribu¬ 
tion  models. 

The  need  to  relax  the  unimodality  assumption  led  first 
to  use  of  mixture  models  [11,  5],  and  then  to  Monte-Carlo 
methods  that  represent  distributions  with  sets  of  discrete 
samples  (particles)  [18,  10,  25,  26].  While  theoretically 
sound,  particle  filtering  methods  are  not  very  successful  in 
high  dimensions  [15]  -  they  require  large  numbers  of  par¬ 
ticles  to  faithfully  represent  the  distribution,  which  entails 
large  computational  costs  of  likelihood  evaluation.  Further¬ 
more,  the  emission  probability  model  used  in  likelihood 
evaluation  is  very  expensive  to  train,  and  is  often  hand- 
designed  in  an  ad-hoc  fashion. 

Several  discriminative  methods  have  been  proposed  for 
visual  pose  tracking.  These  algorithms  apply  various  re¬ 
gression  techniques  while  leveraging  large  number  of  anno¬ 
tated  image  sequences.  For  example,  one  [1],  or  a  mixture 
[27]  of  simple  experts  were  trained  to  predict  current  pose 
based  on  the  past  pose  estimates  and  the  current  observa¬ 
tion.  Robust  regression  combined  with  fast  nearest  neighbor 
search  was  used  for  single  frame  pose  estimation  in  [23]. 

In  this  paper  we  dispense  with  directed  models  alto¬ 
gether  and  opt  for  a  Conditional  Random  Field  (CRF)  [17] 
model.  The  main  advantage  of  this  model  over  generative 
models  is  that  CRFs  do  not  require  specification  (and  eval¬ 
uation)  of  the  emission  probability,  but  only  similarity  be¬ 
tween  state  and  observation(s).  CRFs  are  also  a  more  flexi¬ 
ble  model  than  the  previously  proposed  regression  methods. 
They  allow  for  modeling  the  relationship  between  the  state 


and  an  arbitrary  subset  of  observations.  They  are  also  bet¬ 
ter  able  to  adjust  to  sequences  not  appearing  in  the  training 
data.  For  example  the  MEMM  model  (similar  to  one  used 
in  [27])  has  been  shown  to  be  subject  to  label  bias  problem 

[17]. 

While  in  the  present  work  we  use  a  simple  chain- 
structured  CRF  (Figure  1(b)),  which  directly  models  the  de¬ 
pendency  between  concurrent  state  and  observation,  it  can 
be  extended  by  introducing  more  general  relationships  be¬ 
tween  state  and  observations. 

We  learn  the  observation  potential  function  for  our 
model  using  the  parameter  sensitive  embedding  introduced 
in  [23].  This  algorithm  allows  us  to  learn  a  transformation 
of  images  of  humans  into  a  space  where  the  distance  be¬ 
tween  embeddings  of  two  images  is  likely  to  be  small  if 
the  poses  are  similar  and  large  otherwise.  The  observation 
potential  of  a  particular  pose  is  then  determined  by  the  dis¬ 
tance  between  embeddings  of  the  rendering  of  the  figure  in 
this  pose  and  the  observed  image. 

If  for  every  pose  at  which  we  would  like  to  evaluate  the 
potential  we  had  to  render  the  corresponding  image,  our 
method  would  be  extremely  slow.  By  discretizing  the  con¬ 
tinuous  pose  space,  we  are  able  to  precompute  the  embed¬ 
dings  of  all  discrete  poses  off-line,  thus  drastically  reducing 
run-time  complexity.  Fixing  the  set  of  poses  at  which  ob¬ 
servation  potential  can  be  computed  would  seem  to  be  an 
unreasonable  restriction,  since  we  are  operating  in  a  contin¬ 
uous  pose  space,  but  we  overcome  this  problem  by  using  a 
variant  of  the  grid-filtering  technique  [6,  2] . 

The  main  idea  underlying  grid  filtering  is  that  sufficiently 
discretized  random  variable  is  indistinguishable  from  a  con¬ 
tinuous  one.  That  is,  if  the  distribution  can  approximated  by 
a  piece-wise  constant  function,  then  it  is  sufficient  to  evalu¬ 
ate  it  only  at  one  point  in  every  “constancy  region”  (cell)  [6]. 
This  reduces  a  continuous  estimation  problem  to  a  discrete 
one  (albeit  with  very  large  number  of  discrete  points).  We 
show  that  in  the  case  where  both  observation  potential  and 
the  temporal  prior  are  constant  in  each  cell,  tracking  can  be 
formulated  as  state  estimation  in  an  HMM  framework,  al¬ 
lowing  us  to  use  existing  inference  algorithms;  further,  we 
have  found  in  practice  that  a  manageable  number  of  cells 
suffices  for  realistic  tracking  tasks. 

3  Tracking  with  Conditional  Ran¬ 
dom  Fields 

Figure  1(a)  shows  the  dynamic  generative  model  that  is 
commonly  used  in  tracking  applications.  The  state  (pose1) 
at  time  t  is  denoted  as  6>t,  and  the  observed  images  as  I1 . 
The  full  model  is  specified  by  the  initial  distribution  p(6°), 

Tn  this  work  we  consider  only  first  order  Markov  models  of  motion. 
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(a)  (b)  (c) 

Figure  1:  Chain- structured  generative  (a),  CRF  (b),  and  MEMM  (c)  tracking  models.  In  all  models  the  state  of  the  object 
(pose)  at  time  t  is  specified  by  0*,  and  the  observed  image  by  7*.  The  generative  model  is  described  by  transition  probability 
p(0*|0t_1)  and  the  emission  probability  p(7*|0*).  The  CRF  model  is  described  by  motion  compatibility  (potential)  function 
0(0* ,  0*_1)  and  the  image  compatibility  function  0*  (0*)  =  0(7*,  0*).  Note  the  contrast  with  the  MEMM  model  [27],  specified 
by  the  conditional  distribution  p(0*|0*_1, 7*)  as  shown  in  (c). 


the  transition  probability  model  p(0*|0*_1),  and  the  emis¬ 
sion  distribution  p(7*|0*).  This  model  describes  the  joint 
probability  of  the  state(s)  and  observation(s) 

T 

p(o°-t,  ii~t) = p(o°)  nww-'wm], 

t= i 

from  which  appropriate  conditional  distributions  of  the  pose 
parameters  can  be  derived. 

While  reasonable  approximations  can  be  constructed 
for  the  transition  probability,  p(6t  |0*_1),  the  problem  for 
generative  models  lies  in  specifying  the  emission  model 
p(7*|0*).  In  practice,  to  evaluate  the  likelihood  function  at 
a  particular  pose,  a  figure  in  this  pose  is  first  rendered  as 
an  image,  and  this  image  is  then  compared  with  the  obser¬ 
vation  using  a  certain  metric[13].  Evaluating  the  likelihood 
thus  becomes  computationally  expensive. 

The  major  difference  between  generative-model  based 
approaches  and  ours  is  that  we  formulate  pose  estimation 
as  inference  in  a  Conditional  Random  Field  (CRF)  model, 
and  are  able  to  learn  a  compact  and  efficient  observation 
and  transition  potentials  from  data. 

A  chain  version  of  a  CRF  is  shown  in  Figure  1(b). 
While,  apart  from  the  lack  of  arrows,  it  is  quite  similar 
to  the  generative  model,  the  underlying  computations  are 
quite  different.  This  model  is  specified  by  the  motion  po¬ 
tential  0(0*,0*_1)  and  the  observation  potential  0*(0*)  = 
0(7*,  0*).  The  observation  potential  function  is  the  mea¬ 
sure  of  compatibility  between  the  latent  state  and  the  obser¬ 
vation.  Of  course,  one  choice  for  it  might  be  the  genera¬ 
tive  model’s  emission  probability  p(7*  |0*),  but  this  does  not 
have  to  be  the  case.  It  can  be  modeled  by  any  function  that  is 
large  when  the  latent  pose  corresponds  to  the  one  observed 
in  the  image  and  small  otherwise. 

Rather  than  modeling  the  joint  distribution  of  poses  and 
observations,  the  CRF  directly  models  the  distribution  of 
poses  conditioned  on  observation, 

p(o°-t\ii-t)  = 

t= i 

where  Z  is  a  normalization  constant. 


Once  the  observation  potential  is  defined,  a  chain- 
structured  CRF2  can  be  used  to  perform  on-line  tracking 

p(0t\il- ■  T)  <x  0*(0*)  J (p{et,et~1)p(et~1\i1-t~1)det~1.  (i) 

The  main  advantage  of  this  model  from  our  standpoint  is 
that  the  observation  potential  0*  (0*)  may  be  significantly 
simpler  to  learn  and  faster  to  evaluate  than  the  emission 
probability  p(7*|0*).  Below  we  describe  an  model  of  such 
potential  based  on  similarity  between  images. 

Suppose  that  we  can  measure  the  similarity  S  such  that, 
given  two  images  Ia  and  7^  with  underlying  poses  0a  and 
06,  respectively,  <S(7a,  h)  is  with  high  probability  small  if 
de{Oa ,  Ob)  is  small,  and  large  otherwise.3  Suppose  now  that 
we  are  interested  in  evaluating  the  potential  0(7*,  0),  and 
that  we  have  access  to  an  image  7 9  that  corresponds  to  the 
pose  0  (for  instance,  we  can  render  it  using  computer  graph¬ 
ics).  Then,  we  can  define  the  observation  potential  based  on 
distance  in  the  image  embedding  space: 

4>(I\0)  =  7V(5(/M0);O,<t2).  (2) 

In  this  work,  we  follow  the  approach  in  [23]  for  learn¬ 
ing  a  binary  embedding  H(I )  of  images  such  that  the  L\ 
distance  in  the  77  space  serves  as  a  proxy  for  such  a  pose- 
sensitive  similarity  S.  Briefly,  the  learning  algorithm  is 
based  on  formulating  a  classification  problem  on  image 
pairs  (similar/dissimilar),  and  constructing  an  embedding 
based  on  a  labeled  training  set  of  such  pairs. 

Once  the  desired  M-dimensional  embedding  77  = 
[/ii(7), . . . ,  /im(7)]  has  been  learned,  the  induced  sim¬ 
ilarity  is  the  Hamming  distance  in  77:  S(Ia,Ib)  = 
E™=l\hm(Ia)  -  hm(Ib)\. 

This  potential  could  conceivably  be  used  in  a  continuous 
domain,  for  example  by  using  Monte  Carlo  methods  in  the 
CRF  framework,  as  it  captures  features  relevant  to  pose  esti¬ 
mation  better  than  generic  image  similarity.  Unfortunately 
it  would  not  reduce  computational  cost  since  it  would  re¬ 
quire  rendering  the  image  7 6  at  runtime  for  every  pose  0 
which  we  would  like  to  evaluate. 

2While  both  transition  and  observation  potentials  in  a  CRF  are  often 
trained  jointly,  it  is  possible  to  train  them  separately,  as  we  do  in  this  case. 

3  Here  dg  stands  for  the  appropriate  distance  in  pose  space 
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This  approach  becomes  particularly  efficient  when  we 
have  a  finite  (albeit  large)  set  of  possible  pose  hypotheses 
0i, . . . ,  On-  In  such  a  case  we  can  render  an  image  I{  for 
each  pose  in  the  set,  and  compute  its  embedding  H(Ii). 
The  only  calculation  required  at  runtime  is  computing  the 
embedding  H^I1)  and  calculating  the  Hamming  distances 
between  the  bit  vectors.  We  capitalize  on  this  efficiency  in 
the  grid-filtering  framework  described  in  the  next  section. 

4  Grid  Filtering 

In  the  previous  section  we  have  proposed  a  CRF  tracking 
framework  where  the  observation  potential  is  computed  as 
the  distance  between  embeddings  of  state  and  observation 
described  in  the  previous  section.  Computing  this  potential 
for  an  arbitrary  pose  and  image  is  relatively  slow  since  it 
would  involve  rendering  an  image  of  a  person  in  this  pose 
and  then  computing  the  embedding.  This  is  part  of  the 
problem  with  generative-model-based  tracking  which  we 
wanted  to  avoid. 

Fortunately,  if  all  of  the  poses  where  the  observation  po¬ 
tential  is  to  be  evaluated  are  known  in  advance,  then  we  can 
precompute  the  appropriate  embedding  off-line,  drastically 
reducing  runtime  evaluation  cost.  We  would  then  compute 
a  single  embedding  for  the  observed  image,  which  would  be 
amortized  when  the  potential  is  evaluated  at  multiple  poses. 

While  fixing  the  poses  in  advance  seems  too  restrictive 
for  continuous  space  inference,  grid-based  techniques  pio¬ 
neered  by  [4,  16]  show  that  this  can  be  a  profitable  approx¬ 
imation.  The  main  idea  underlying  these  methods  is  that 
many  functions  of  interest  can  be  approximated  by  piece- 
wise  constant  functions,  if  the  region  of  support  for  each 
constant  “piece”  is  small  enough.  As  metioned  above,  we 
follow  the  convention  and  denote  such  region  of  support  as 
a  “cell”. 

In  our  case,  the  function  we  are  interested  in  is  the  pos¬ 
terior  probability  of  the  pose  conditioned  on  all  previously 
seen  observations  (including  the  current  one).  The  poste¬ 
rior  is  proportional  to  the  product  of  the  temporal  prior  (the 
pose  probability  based  on  the  estimate  at  the  previous  time- 
step  and  the  motion  model)  and  the  observation  potential. 
We  would  like  to  define  the  cells  such  that  both  of  the  com¬ 
ponents  are  almost  constant.  The  observation  potential  is 
often  sharply  peaked,  so  the  cells  should  be  small  in  the  re¬ 
gions  of  pose  space  where  we  expect  large  appearance  vari¬ 
ations,  but  large  in  other  regions.  On  the  other  hand  the 
motion  models  are  usually  (and  our  work  is  no  exception) 
very  approximate  and  compensate  for  it  by  inflated  dynamic 
noise.  Thus  the  temporal  prior  is  broad  and  should  also  be 
approximately  constant  on  cells  small  enough  for  observa¬ 
tion  potential  constancy.  We  derive  the  grid  filter  based  on 
the  assumption  that  the  partition  of  the  pose  space  into  cells 
with  the  properties  described  above  is  available. 


Let  the  space  of  all  valid  poses  0  be  split  into  N  dis¬ 
joint  (and  not  necessarily  regular)  cells  Ci,  ©  =  U fLiCi, 
CiOCj  =  0,  i  ^  j,  such  that  both  likelihood  and  prior  can 
be  approximated  as  constant  within  each  cell.  Furthermore, 
let  us  have  a  sample  6i  E  Ci  in  every  cell.  The  set  of  sam¬ 
ple  points  {Oi}i  is  referred  to  as  “grid”  in  the  grid-filtering 
framework. 

By  virtue  of  our  assumptions,  the  temporal  prior  can  be 
expressed  as 

p(0*  eCilO*-1  eCj)=  f  f  </>(#*, Ot~1)d6t~1d0t  (3) 

JCi  JCj 

where  \Ci\  is  the  volume  of  the  it h  cell,  with  the  approxi¬ 
mation  valid  when  the  noise  covariance  in  the  transition  is 
much  wider  than  the  volume  of  the  cell.  So  the  (time  inde¬ 
pendent)  transition  probability  from  jth  to  it h  cell  is 

T  _  (4) 

The  compatibility  between  observation  and  the  pose  be¬ 
longing  to  a  particular  cell  can  be  written  as 

f  time « timcii  (5) 

JCi 

Combining  eqs  1,3,  and  5,  the  posterior  probability  of 
pose  being  in  the  ith  cell  is 

3  = 1 

(6) 

1  N 

=  z<l>to(ei)YlSijp{0t-1  eCjV1-*-*), 

J=1 


where  Sij  =  |C*|T^  is  time  independent  and  can  be  com¬ 
puted  offline. 

If  we  denote 


/#(0i)\ 

7Tt  = 

p(0*  eC2\l1-t) 

and  /'  = 

#(*2) 

W  G  CnII1-*)) 

\#(0nO/ 

then  the  posterior  can  be  written  in  vector  form 

TT*  =  (7) 


where  .*  is  the  element-wise  product,  and  the  scaling  factor 
W  =  *  lt)i  is  necessary  for  probabilities  to 

sum  up  to  unity. 
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Algorithm 

CRP 

CRPS 

kNN 

ICP 

CND 

ELMO 

Seconds 

0.05 

0.07 

0.5 

0.1 

120 

8 

Table  1:  Average  times  required  for  algorithms  tested  to 
process  a  single  frame. 


The  final  equation  has  striking  resemblance  to  the 
standard  HMM  update  equations.  It  defines  our  on¬ 
line  CONDITIONAL  RANDOM  PERSON  tracking  algorithm 
(CRP).  We  can  also  use  standard  HMM  inference  methods 
[20]  to  define  a  batch  version  of  CRP:  CRP  SMOOTHED 
(CRPS)  uses  a  forward-backward  algorithm  to  find  the  pose 
distribution  at  every  time  step  conditioned  on  all  observed 
images.  In  addition,  the  most  likely  pose  sequence  can  be 
found  by  using  Viterbi  decoding  and  we  call  the  resulting 
method  CRP  VITERBI  (CRPV). 

5  Implementation  and  Evaluation 

We  have  implemented  CRP  and  CRPS  as  described  in  the 
previous  sections.  We  have  used  the  database  of  300,000 
pose  exemplars  generated  from  a  large  set  of  motion  capture 
data  in  order  to  cover  a  range  of  valid  poses.  The  images 
are  synthetic,  and  were  rendered,  along  with  the  foreground 
segmentations  masks,  in  Poser  [7]  for  a  fixed  viewpoint. 
The  motion-capture  sequences  are  available  from  [12]  and 
include  large  body  rotations,  complex  motions,  and  self¬ 
occlusions.  The  transition  matrix  was  computed  by  locating 
1000  nearest  neighbors  in  joint  position  space  for  each  ex¬ 
emplar,  and  setting  the  probability  of  transitioning  to  each 
neighbor  to  the  be  Gaussian  with  a  =  0.25.  The  volume 
of  each  cell  was  approximated  as  that  of  a  ball  with  radius 
equal  to  the  median  distance  to  50  nearest  neighbors. 

We  used  the  multiscale  edge  direction  histogram 
(EDH)  [23]  as  the  basic  representation  of  images.  The 
binary  embedding  H  is  obtained  by  thresholding  individ¬ 
ual  bins  in  the  EDH.  It  was  learned  using  a  training  set  of 
200,000  image  pairs  with  similar  underlying  poses  (we  fol¬ 
lowed  an  approach  outlined  in  [29]  for  estimating  false  neg¬ 
ative  rate  of  a  paired  classifier  without  explicitly  sampling 
dissimilar  pairs).  This  resulted  in  2,575  binary  dimensions. 

The  tracking  algorithms  are  initialized  by  searching  for 
50  exemplars  in  the  database  closest  to  the  first  frame  in  the 
sequence  in  the  embedding  space. 

Due  to  the  sizes  of  the  database  and  the  transition  ma¬ 
trix,  both  algorithms  require  large  amounts  of  memory,  so 
we  performed  our  tests  on  a  computer  with  3.4GHz  Pen¬ 
tium  4  processor  and  2GB  of  RAM.  The  algorithms  were 
implemented  in  C++,  and  were  able  to  achieve  real-time 
performance  with  average  speeds  of  20  frames  per  second 
for  CRP  and  14  frames  per  second  for  CRPS. 


5.1  Experiments  with  synthetic  data 

We  have  quantitatively  evaluated  the  performance  of  our 
tracking  method  on  a  set  of  motion  sequences.  These  se¬ 
quences  were  obtained  in  the  similar  way  as  the  sequence 
used  for  training  our  algorithm  but  were  not  included  in  the 
training  set. 

We  compared  our  online  algorithm,  CRP,  and  its  batch 
version  CRPS  (CRP  SMOOTHED),  to  four  state-of-the-art 
pose  estimation  algorithms.  The  first  baseline  was  a  state¬ 
less  k-Nearest  Neighbors  (kNN)  algorithm  that  at  every 
frame  searches  the  whole  database  for  50  closest  poses 
based  on  the  embedding  distance.  The  remaining  base¬ 
line  methods  were  incremental  tracking  algorithms:  deter¬ 
ministic  gradient  descent  method  using  the  Iterative  Clos¬ 
est  Point  (ICP)  algorithm  [8],  CONDENSATION  [25],  and 
ELMO  [9] .  The  ICP  algorithm  directly  maximizes  the  like¬ 
lihood  function  at  every  frame,  whereas  CONDENSATION 
and  ELMO  evaluated  the  full  posterior  distribution.  In  our 
experiments,  the  posterior  distribution  in  CONDENSATION 
was  modeled  using  2000  particles.  In  ELMO,  the  poste¬ 
rior  distribution  was  modeled  using  a  mixture  of  5  Gaus- 
sians.  The  likelihood  function  defined  in  ICP,  CONDEN¬ 
SATION  and  ELMO  was  based  on  the  Euclidean  distance 
between  the  articulated  model  and  the  3D  (reconstructed) 
points  of  the  scene  obtained  from  a  real-time  stereo  system. 
In  contrast,  both  CRP  and  kNN  algorithms  require  only  sin¬ 
gle  view  intensity  images  and  foreground  segmentation. 

We  have  chosen  to  use  the  mean  distance  between  es¬ 
timated  and  true  joint  positions  as  an  error  metric  [3].  In 
Figure  2  we  show  the  performance  of  6  algorithms  de¬ 
scribed  above  on  four  synthetic  sequences.4  As  can  be  seen, 
both  CRP  and  CRPS  consistently  outperform  kNN,  and 
CONDENSATION5,  and  compare  favorably  to  ICP.  While 
CRP  produces  somewhat  worse  results  than  ELMO,  it  does 
not  use  stereo  data,  and  is  hundred  and  sixty  times  faster. 
The  timing  information  for  all  compared  algorithms  is  pre¬ 
sented  in  Table  l.6 

5.2  Statistical  analysis  of  results 

In  order  to  evaluate  the  statistical  significance  of  these  re¬ 
sults  we  used  the  following  methodology.  We  use  the  mean 
joint  position  error  as  a  measure  of  accuracy  of  pose  predic¬ 
tion  on  a  given  frame.  Suppose  that  algorithms  A  and  B  are 
both  tested  on  the  total  of  N  frames,  producing  on  the  frame 
i  errors  ef  and  ef ,  respectively.  The  quantity  of  interest  is 
the  error  difference  df~B  =  ef  —  ef.  Figure  3  shows 

4 These  results  reflect  correction  of  an  error  in  earlier  CRPS  implemen¬ 
tation. 

5  Increasing  the  number  of  particles  used  for  Condensation  should  im¬ 
prove  performance,  but  the  computational  costs  would  become  prohibitive. 

6We  have  used  more  iterations  of  gradient  descent  than  the  implemen¬ 
tation  described  in  [9]. 
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Mean  errors  in  joint  positions  on  sequence:  applause2 


Mean  errors  in  joint  positions  on  sequence:  brushTeeth2 


(a) 


(b) 


Mean  errors  in  joint  positions  on  sequence:  dryer2 


Mean  errors  in  joint  positions  on  sequence:  salute-chest-azumi 


(C) 


(d) 


Figure  2:  Comparing  algorithm  performance  on  four  synthetic  sequences:  “applause”  (a),  “brush  teeth” (b),  “dryer”  (c),  and 
“salute”  (d).  The  error  is  measured  as  an  average  distance  between  true  and  estimated  joint  positions.  The  graphs  are  best 
viewed  in  color. 


CRPS 


vs.  CND:  73.7%  /  9cm  vs.  ELMO:  22.4%  /  -5cm 


Figure  3:  Distributions  of  improvements  in  joint  position  estimates  of  CRP  (first  row)  and  CRPS  (second  row)  vs.  kNN  (first 
column),  ICP  (second),  CONDENSATION  (third),  and  ELMO  (fourth).  Negative  values  along  the  x-axis  mean  lower  error  for 
the  proposed  algorithm.  Given  for  each  comparison  are  the  proportion  of  frames  in  which  CRP/CRPS  were  better  than  the 
alternative,  and  the  average  improvement  in  error.  See  text  for  results  on  statistical  significance. 


the  distribution  of  these  differences  between  our  algorithms 
(CRP  and  CPRS)  and  competing  algorithms  computed  over 
a  large  number  of  synthetic  sequences.  For  example,  the 
top  right  plot  shows  the  distribution  of  dCRP~ELMO .  Neg¬ 
ative  value  of  dA~B  means  that  on  frame  i  the  algorithm  A 
was  better  than  the  algorithm  B.  The  lack  of  a  parametric 
model  for  the  distribution  of  dA~B  makes  it  difficult  to  ap¬ 
ply  thorough  statistical  testing  to  hypotheses  involving  the 
mean  of  that  distribution.  Therefore,  the  anlysis  below  fo¬ 
cuses  on  the  median,  which  lends  itslef  more  easily  to  non- 
parametric  tests. 

One  question  we  can  ask  is  whether  the  results  support 
the  conclusion  that  A  is  expected  to  be  better  more  than  half 
the  time.  We  answer  this  question  using  the  binomial  sign 
test  [24].  Intuitively,  it  is  equivalent  to  modeling  the  out¬ 
come  of  each  comparison  (on  one  frame)  by  a  coin  flip  in 
which  “tails”  means  that  the  sign  of  dA~B  is  negative.  The 
null  hypothesis  we  wish  to  reject  is  that  the  coin  is  fair.  We 
applied  this  test  to  the  data  histogrammed  in  Figure  3,  us¬ 
ing  p- value7  of  p  =  0.001.  At  this  significance  level,  CRP 
was  better  than  &NN  and  CONDENSATION  and  worse  than 
ELMO.  CRPS  was  better  than  fcNN,  CONDENSATION  and 
ICP  and  worse  than  ELMO;  we  could  not  establish  signifi¬ 
cant  differences  in  error  of  CRP  vs.  ICP. 

A  more  refined  statistical  evaluation  of  the  difference  in 
performance  between  two  estimation  algorithms  is  based  on 
establishing  a  confidence  interval  on  the  median  improve¬ 
ment.  Given  the  desired  confidence  value  p  we  seek  a  value 
D  such  that  the  probability  of  the  median  difference  of  the 
errors  being  above  D  is  less  than  p. 

We  apply  the  following  procedure  to  perform  this  test. 
Suppose  that  D  is  the  g-upper  quantile  of  the  observed  dis¬ 
tribution  of  dA~B ,  i.e.  qN  values  are  above  D.  Under  the 
assumption  that  the  true  median  of  the  distribution  lies  be¬ 
low  D ,  we  have  Pd  =  Pr(dA~B  >  D)  <  1/2.  Now,  we 
define  a  random  variable  Zd  that  is  the  count  of  observed 
values  of  dA~B  that  exceed  D.  Its  distribution  is  binomial 
with  parameters  Pd  and  N.  Consequently, 

Pv(Zd  P  </A)  =  Bino {Zd]  Pdi  A")  <  Bino {Zd]  1/2,  A), 

(8) 

where  Binomial (x\  p.  n)  =  (™)px(l  —  p)n~x .  Using  De 
Moivre-Laplace  approximation  [19], 

Bino (ZD;  1/2,  N)  w  N(ZD,N/2,VN/2),  (9) 

where  N(x;  p,a)  =  exp(—(x  —  /j,)2 /2a2).  Combin- 
ing  (8)  and  (9),  and  solving  for  the  desired  signifcance  p, 
we  get 

q  =  G~l(l-p;  N/2,VN/2)/N, 

7  The  p- value  is  the  probability  of  obtaining  the  observed  data  under 
the  null  hypothesis;  that  hypothesis  is  rejected  if  the  p- value  falls  below  a 
specified  threshold,  which  determines  the  significance  of  the  test. 


Method 

kNN 

ICP 

CND 

ELMO 

CRP  vs. 

-1.5cm 

0.04cm 

-7.13cm 

4.57cm 

CRPS  vs. 

-1.75cm 

-0.28cm 

-7.47cm 

4.27cm 

Table  2:  Confidence  intervals  for  median  error  reduction, 
with  p  =  0.001:  with  probability  1  —  p,  the  true  median 
of  dA~B  falls  below  the  value  for  row  A  and  column  B. 
Negative  values  indicate  cases  where  we  are  confident  with 
respect  to  the  improvement  (error  reduction)  achieved  by 
CRP/CRPS  over  competing  methods. 


where  G  is  the  inverse  of  the  normal  (gaussian)  cumulutive 
distribution  function.  In  other  words,  if  we  choose  the  value 
of  D  corresponding  to  such  q ,  the  probability  of  the  true  me¬ 
dian  being  lower  than  D  is  at  least  1  —p.  Results  of  this  test 
for  p  =  0.001  are  given  in  Table  2:  there  is  a  robust  advan¬ 
tage  to  both  CRP  methods  over  /cNN  and  Condensation,  but 
not  over  ELMO. 

A  relatively  large  difference  between  estimated  mean 
(Figure  3)  and  median  (Table  2)  improvements  of  CRP  vari¬ 
ants  over  ICP  can  be  explained  by  the  fact  that  ICP  is  more 
likely  to  completely  loose  track  (thus  producing  large  er¬ 
rors)  than  CRP. 

5.3  Experiments  with  real  data 

For  the  real  data,  segmentation  masks  were  computed  us¬ 
ing  color  background  subtraction.  Sample  frames  from  a 
complicated  real  image  motion  sequence  are  shown  in  the 
Figure  4  (the  video  of  the  original  sequence  and  results  of 
our  algorithm  are  available  as  supplementary  material)  and 
Figure  5.  The  top  right  pane  in  the  supplementary  video 
was  obtained  by  smoothing  CRPV  output  and  rendering  the 
resulting  pose  sequence. 

6  Conclusions  and  Discussion 

We  have  presented  CRP,  an  algorithm  for  tracking  articu¬ 
lated  human  motion  in  real-time.  The  main  contributions 
of  this  work  are  the  discriminative  CRF  formulation  of  the 
tracking  problem;  use  of  similarity  preserving  embedding 
for  modeling  observation  potential  function;  and  the  grid- 
filter  inference  algorithm  that  transforms  the  continuous 
density  estimation  problem  into  a  discrete  one.  The  result¬ 
ing  algorithm  is  capable  of  accurately  tracking  complicated 
motions  in  real-time  (20fps  in  our  experiments  for  both  syn¬ 
thetic  and  real  data). 

As  future  work  we  are  interested  in  using  extra  domain 
knowledge  to  further  improve  the  performance  of  the  algo¬ 
rithm  in  two  ways.  First,  when  the  set  of  poses  that  need  to 
be  tracked  is  restricted,  then  the  size  of  the  sample  database 
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Figure  4:  Sample  frames  from  a  gesture  sequence  (first  row),  segmentation  masks  (second  row)  and  the  corresponding  frames 
from  a  most  likely  sequence  computed  by  CRPV  algorithm  (third  row).  See  supplimentary  material  for  the  full  video. 


Figure  5:  Sample  frames  from  a  gesture  sequence  (first  row),  segmentation  masks  (second  row)  and  the  corresponding  frames 
from  a  most  likely  sequence  conputed  by  CRPV  algorithm  (third  row) 


can  be  decreased  by  removing  all  of  the  unnecessary  poses. 
Second,  when  the  motion  patterns  are  constrained,  for  ex¬ 
ample  in  a  dance,  tracking  can  be  made  more  robust  by  us¬ 
ing  a  specialized  transition  matrix  (resulting,  in  the  limit,  in 
tracking  on  a  motion  graph). 
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